Preserve stereochemistry and explicit aromatic '<-->' bonds in Mermaid↔RDKit converters#1
Conversation
There was a problem hiding this comment.
Pull request overview
This PR makes the Mermaid ↔ RDKit converters preserve stereochemistry and aromaticity across round trips. It adds a new <--> aromatic bond token, switches chiral atom-ID suffixes from raw ChiralTag values to RDKit's absolute _CIPCode (R/S), and defers double-bond E/Z and tetrahedral R/S assignment until after Chem.SanitizeMol, applying them through new helper methods that search for the chirality that yields the requested CIP code.
Changes:
- Add
<-->aromatic bond mapping + regex support and propagate aromatic flags to atoms/bonds in bothmoleculeandmarkushparsers. - Postpone stereochemistry assignment until after sanitization via new
_assign_chirality_from_ids/_assign_double_bond_stereohelpers; emit atom IDs using_CIPCoderather thanChiralTag. - Make the system prompt a raw string and add round-trip tests for aromatic, E/Z, and R/S cases.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
molecode/molecule/mermaid_to_rdkit.py |
Adds <--> bond mapping/regex, defers stereo assignment, adds helpers that recover absolute CIP R/S by trying both chiral tags. |
molecode/molecule/rdkit_to_mermaid.py |
Uses _CIPCode instead of ChiralTag to suffix atom IDs with R/S. |
molecode/markush/mermaid_to_rdkit.py |
Mirrors the molecule parser changes for the markush variant. |
molecode/markush/rdkit_to_mermaid.py |
Mirrors the molecule exporter change to use _CIPCode. |
molecode/prompts/molecule_system_prompt.py |
Switches BASE_INSTRUCTION to a raw string to prevent escape-sequence reinterpretation. |
tests/test_molecule_stereochemistry.py |
New round-trip tests covering aromatic <-->, ===|E|/===|Z|, and _R/_S CIP suffixes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
💡 Codex Review
Adding <--> here means mol_to_mermaid(..., kekulize=False) now emits syntax that MOLECULE_SYSTEM_PROMPT still tells the model is invalid: molecode/prompts/molecule_system_prompt.py:223-239 says aromatic bonds must use Kekulé form and that <--> cannot be recognized, and the common-errors table at line 547 still marks it wrong. Any LLM workflow using that prompt to generate or edit non-kekulized Mermaid will be instructed to reject/rewrite valid converter output, so the new explicit-aromatic feature is not usable through the prompted path until the prompt is updated.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Extends PR #1's stereochemistry fix (previously molecule + markush only) to the polymer path, which until now silently dropped chirality and double-bond geometry on round-trip. - polymer_to_mermaid: emit absolute CIP R/S (_CIPCode) in atom ids instead of order-dependent CHI_TETRAHEDRAL_CW/CCW (same bug fixed in the molecule converter). - mermaid_to_psmiles: parse _R/_S id suffixes and ===|E|/===|Z| markers, then restore them after sanitize — chirality via brute-forced CIP matching, and double-bond geometry via SetDoubleBondNeighborDirections so SMILES emits / and \. - tests/test_polymer_stereochemistry.py: PLA R/S, cis/trans polybutadiene, combined chiral+E/Z, and plain-polymer regression coverage. Verified: chiral and E/Z PSMILES now round-trip to canonical-identical SMILES, and cis is distinguished from trans. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Motivation
Description
<-->mapping toChem.BondType.AROMATICand updated bond regexes to accept the new aromatic token inmolecode/markush/mermaid_to_rdkit.py,molecode/molecule/mermaid_to_rdkit.py, and related modules.Chem.SanitizeMolis called by collecting stereo records and applying them with new helper methods_assign_chirality_from_idsand_assign_double_bond_stereowhich restore CIP R/S and E/Z configurations.rdkit_to_mermaidconverters to use RDKit's_CIPCodeproperty (absoluteR/S) instead of rawChiralTagvalues so exported IDs are chemically meaningful and stable.tests/test_molecule_stereochemistry.pywith round-trip tests for aromatic bonds, E/Z double bonds, and R/S tetrahedral stereochemistry.Testing
pytest tests/test_molecule_stereochemistry.pywhich exercises aromatic non-kekulized rings, E/Z double-bond stereochemistry, and R/S tetrahedral stereochemistry, and all tests passed.<-->aromatic token,===|E|/===|Z|double-bond stereo markers, and_R/_SCIP atom-id suffixes, and that round-tripped molecules match canonical isomeric SMILES.Codex Task